Let’s start with reading the computed metrics for all projects.
## [1] TRUE
## 'data.frame': 2835 obs. of 19 variables:
## $ project : Factor w/ 13 levels "black","cookiecutter",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ bug_number : int 1 3 4 6 7 8 10 11 14 15 ...
## $ granularity : Factor w/ 3 levels "function","statement",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ technique : Factor w/ 7 levels "DStar","Metallaxis",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ crashing : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ predicate : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
## $ ismutable : logi FALSE FALSE FALSE TRUE TRUE TRUE ...
## $ mutability : num 0 0 0 0.112 0.119 ...
## $ time : num 132.4 104.2 68.5 58.8 64.8 ...
## $ einspect : num 4 100.5 39.5 11 29 ...
## $ is_bug_localized: int 1 1 1 1 1 1 1 1 1 1 ...
## $ exam : num 0.0099 0.3018 0.1282 0.0364 0.0967 ...
## $ java_exam_score : num 0.0099 0.3018 0.1282 0.0364 0.0967 ...
## $ cdist : num NA NA NA NA NA NA NA NA NA NA ...
## $ svcomp : num NA NA NA NA NA NA NA NA NA NA ...
## $ minutes : num 2.21 1.74 1.14 0.98 1.08 ...
## $ family : Factor w/ 4 levels "MBFL","PS","ST",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ category : Factor w/ 4 levels "CL","DEV","DS",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ bugid : Factor w/ 135 levels "black1","black10",..: 1 9 10 11 12 13 2 3 4 5 ...
We have data about 135 bugs
in rlength(unique(datas$project))` analyzed projects.
Let’s see an example of visual and statistical comparison of two groups of experiments for the same bugs.
To make the example concrete, let’s pick two groups and compare their \(E_{\text{inspect}}\) scores on statement-level fault localization:
Since there are three experiments per bug using SBFL, but only two experiments per bug using MBFL, we’ll aggregate scores for the same bug by average.
Let’s start with some visualization: a scatterplot with a point for each bug; each point has coordinates \(x, y\) where \(x\) is its score in MBFL and \(y\) its score in SBFL.
As you can see, there are a bulk of bugs for which SBFL performs very similarly to MBFL (points close to the \(x = y\) straight line). However, for several other bugs, SBFL is much better (remember that lower is better for this score).
Looking at the colors, we notice that several bugs in the CL (and possibly DS) category are overrepresented among the “harder” bugs on which SBFL behaves much better than MBFL.
Analyzing the same data numerically, we can compute the correlation (Kendall’s \(\tau\)) between \(S\) and \(M\):
##
## Kendall's rank correlation tau
##
## data: S and M
## z = 7.8047, p-value = 5.965e-15
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.5403952
A correlation of 0.5403952 is not super strong, but clearly defined.
Finally, we may also perform a statistical test (Wilcoxon’s paired test) and compute a matching effect size (Cliff’s delta).
##
## Wilcoxon signed rank test with continuity correction
##
## data: S and M
## V = 1068, p-value = 0.005209
## alternative hypothesis: true location shift is not equal to 0
##
## Cliff's Delta
##
## delta estimate: -0.1761866 (small)
## 95 percent confidence interval:
## lower upper
## -0.29548868 -0.05147369
Cliff’s delta, in particular, roughly measures how often the value in one set are larger than the value in the other set. Thus, the given value means that SBFL’s \(E_{\text{inspect}}\) score is smaller than MBFL’s roughly in 18% of the cases.
These statistics, for what they’re worth, seem to confirm that there is a noticeable difference in favor of SBFL.
Now, let’s generalize this to a scatterplot matrix to show the relations between all possible pairs of FL families.
First, we define a bunch of helper functions.
Then, we use them to generate plots for \(E\).
Now, it’s easy to compute a similar plot for other metrics. For example, running time (in minutes):
Let’s build a simple multivariate regression model,
where we predict einspect and time from bug and technique.
First, we standardize the predictors, so that it’s much easier to set sensible priors.
Here’s a basic regression model, where the only unusual aspects are that it’s multivariate, and log-transforms the mean (since both outcome variables must be nonnegative).
eq.m1 <- brmsformula(
mvbind(einspectS, timeS) ~ 0 + family + category,
family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)
pp1.check <- get_prior(eq.m1, data=by.statement)
pp1 <- c(
set_prior("normal(0, 1.0)", class="b", resp=c("einspectS", "timeS")),
set_prior("weibull(2, 1)", class="sigma", resp=c("einspectS", "timeS"))
)
Let’s do the usual checks to make sure that everything is fine with the fitting.
Prior checks, confirming that the sampled priors span a wide range of values, amply including the data.
Now we fit the actual model.
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 4 finished in 12.0 seconds.
## Chain 1 finished in 12.2 seconds.
## Chain 2 finished in 12.2 seconds.
## Chain 3 finished in 13.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 12.4 seconds.
## Total execution time: 13.4 seconds.
Next, we check the usual diagnostics:
## [1] 0
## [1] 1.004918
## [1] 0.3498535
Finally, we check the posteriors, to ensure that we have a decent approximation of the data.
As you can see, the simulated posteriors are acceptable given that the data is complex, whereas the model is quite simplistic (we’ll improve it soon).
## Family: MV(gaussian, gaussian)
## Links: mu = log; sigma = identity
## mu = log; sigma = identity
## Formula: einspectS ~ 0 + family + category
## timeS ~ 0 + family + category
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## einspectS_familyMBFL -2.93 0.50 -4.03 -2.06 1.00 4578
## einspectS_familyPS 0.43 0.08 0.27 0.58 1.00 4733
## einspectS_familyST 0.46 0.08 0.29 0.61 1.00 4045
## einspectS_familySBFL -3.44 0.48 -4.45 -2.57 1.00 5027
## einspectS_categoryDEV -2.63 0.48 -3.64 -1.81 1.00 4929
## einspectS_categoryDS -0.67 0.14 -0.95 -0.42 1.00 3836
## einspectS_categoryWEB -2.38 0.48 -3.43 -1.57 1.00 4807
## timeS_familyMBFL -1.39 0.20 -1.82 -1.01 1.00 1415
## timeS_familyPS -2.09 0.28 -2.70 -1.59 1.00 2049
## timeS_familyST -3.75 0.45 -4.67 -2.96 1.00 3648
## timeS_familySBFL -4.38 0.45 -5.31 -3.56 1.00 3381
## timeS_categoryDEV 0.96 0.27 0.43 1.47 1.00 1748
## timeS_categoryDS 1.49 0.21 1.08 1.91 1.00 1399
## timeS_categoryWEB -0.95 0.66 -2.41 0.15 1.00 3819
## Tail_ESS
## einspectS_familyMBFL 2861
## einspectS_familyPS 2968
## einspectS_familyST 3075
## einspectS_familySBFL 2787
## einspectS_categoryDEV 2893
## einspectS_categoryDS 3079
## einspectS_categoryWEB 2670
## timeS_familyMBFL 1864
## timeS_familyPS 2727
## timeS_familyST 2896
## timeS_familySBFL 2580
## timeS_categoryDEV 2317
## timeS_categoryDS 2123
## timeS_categoryWEB 2669
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS 0.86 0.02 0.82 0.90 1.00 5808 3431
## sigma_timeS 0.93 0.02 0.88 0.97 1.00 4492 3001
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS) 0.07 0.03 0.00 0.14 1.00 4829
## Tail_ESS
## rescor(einspectS,timeS) 2993
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
What’s noticeable here is the residual correlation
between the two outcomes einspect and time is quite small (7%),
which means that there is not much of a consistent dependency
between these two variables.
Let’s set up some functions to analyze the posterior samples of m1 (and similar models).
Let’s use these functions to first analyze the effects per family of FL techniques.
## $ints
## MBFL PS SBFL ST
## |0.5 -1.8403733 1.750277 -2.2904133 1.787965
## |0.7 -1.9987933 1.723377 -2.4933733 1.755788
## |0.9 -2.3258633 1.679919 -2.8417033 1.699362
## |0.95 -2.5844133 1.656920 -2.9724333 1.667282
## |0.99 -2.9603633 1.589496 -3.3652433 1.613864
## 0.99| -0.4593433 2.011449 -0.9055133 2.031262
## 0.95| -0.6456633 1.958796 -1.1198133 1.988802
## 0.9| -0.6952333 1.932004 -1.3104833 1.966496
## 0.7| -0.9746733 1.881454 -1.5097333 1.920313
## 0.5| -1.1869033 1.854129 -1.6453033 1.896548
##
## $est
## NULL
## $ints
## MBFL PS SBFL ST
## |0.5 1.3932608 0.65172082 -1.7136392 -1.10199918
## |0.7 1.3372008 0.54789082 -1.8752192 -1.23743918
## |0.9 1.1835608 0.36650082 -2.2369192 -1.56690918
## |0.95 1.1137808 0.26052082 -2.3766392 -1.74514918
## |0.99 0.9678708 0.01876082 -2.6434592 -2.13414918
## 0.99| 2.0093678 1.45849082 -0.3948592 0.15032082
## 0.95| 1.9086418 1.33810082 -0.6333692 -0.03348918
## 0.9| 1.8495308 1.26726082 -0.7575192 -0.13368918
## 0.7| 1.7494308 1.10962082 -0.9557192 -0.30827918
## 0.5| 1.6572508 1.01740082 -1.1247692 -0.49940918
##
## $est
## NULL
It’s clear that for both outcomes, e_inspect and time,
there are clear differences (with high probability)
in the contribution over the mean from different families of techniques.
Looking at the effects by category of project does not
yeld as strong differences, but we can see that DS projects tend to be associated with worse (higher) e_inspect and longer running times,
and that WEB projects are associated with
shorter running times.
## $ints
## DEV DS WEB
## |0.5 -1.0046079 1.1276011 -0.7240479
## |0.7 -1.1353879 1.1107671 -0.9001379
## |0.9 -1.4761379 0.9893531 -1.2261379
## |0.95 -1.6771979 0.9489131 -1.4957179
## |0.99 -2.0851179 0.8482821 -1.8106379
## 0.99| 0.3128821 1.5513121 0.5550221
## 0.95| 0.1259921 1.4800181 0.3450021
## 0.9| 0.0804521 1.4342721 0.3041321
## 0.7| -0.1538179 1.3848411 0.0720821
## 0.5| -0.3640279 1.3044731 -0.0804679
##
## $est
## NULL
## $ints
## DEV DS WEB
## |0.5 0.27467305 0.845127 -1.681273
## |0.7 0.19445705 0.768187 -2.001653
## |0.9 0.03318005 0.625197 -2.491093
## |0.95 -0.04185295 0.585337 -2.839633
## |0.99 -0.24780795 0.456786 -3.446723
## 0.99| 1.16100705 1.535847 -0.122691
## 0.95| 0.99543705 1.413317 -0.307852
## 0.9| 0.88579705 1.310127 -0.358522
## 0.7| 0.73274705 1.196797 -0.681840
## 0.5| 0.62126705 1.125677 -0.834484
##
## $est
## NULL
Let’s make the model more sophisticated, with varying effects, and modeling these effects as possibly correlated (which makes sense, since we have two model parts)
eq.m2 <- brmsformula(
mvbind(einspectS, timeS) ~ 1 + (1|p|family) + (1|q|category),
family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)
pp2.check <- get_prior(eq.m2, data=by.statement)
pp2 <- c(
set_prior("normal(0, 1.0)", class="Intercept", resp=c("einspectS", "timeS")),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="family", resp=c("einspectS", "timeS")),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="category", resp=c("einspectS", "timeS")),
set_prior("gamma(0.01, 0.01)", class="sigma", resp=c("einspectS", "timeS"))
)
Let’s fit \(m_2\) and check the fit.
Prior checks:
We fit model \(m_2\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 2 finished in 76.4 seconds.
## Chain 4 finished in 77.2 seconds.
## Chain 1 finished in 78.0 seconds.
## Chain 3 finished in 82.7 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 78.6 seconds.
## Total execution time: 82.8 seconds.
Diagnostics:
## [1] 0
## [1] 1.004241
## [1] 0.3730994
Posterior checks:
We don’t notice a clear improvement compared to \(m_1\). Let’s compare the two models using LOO.
## Output of model 'm1':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2485.8 97.2
## p_loo 46.2 6.8
## looic 4971.7 194.3
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 942 99.7% 749
## (0.5, 0.7] (ok) 2 0.2% 415
## (0.7, 1] (bad) 1 0.1% 104
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm2':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2479.7 98.4
## p_loo 47.6 7.1
## looic 4959.4 196.9
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 944 99.9% 346
## (0.5, 0.7] (ok) 1 0.1% 163
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
##
## Model comparisons:
## elpd_diff se_diff
## m2 0.0 0.0
## m1 -6.2 2.1
\(m_1\)’s score is more than 2.9 standard deviations worse than \(m_2\)’s, which is a pretty significant difference in favor of \(m_2\) in terms of predictive capabilities.
## Family: MV(gaussian, gaussian)
## Links: mu = log; sigma = identity
## mu = log; sigma = identity
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category)
## timeS ~ 1 + (1 | p | family) + (1 | q | category)
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~category (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI
## sd(einspectS_Intercept) 0.67 0.12 0.46 0.93
## sd(timeS_Intercept) 0.65 0.12 0.43 0.91
## cor(einspectS_Intercept,timeS_Intercept) -0.11 0.24 -0.54 0.36
## Rhat Bulk_ESS Tail_ESS
## sd(einspectS_Intercept) 1.00 3839 3353
## sd(timeS_Intercept) 1.00 3874 2820
## cor(einspectS_Intercept,timeS_Intercept) 1.00 2905 2639
##
## ~family (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI
## sd(einspectS_Intercept) 0.90 0.12 0.69 1.14
## sd(timeS_Intercept) 0.76 0.12 0.54 1.02
## cor(einspectS_Intercept,timeS_Intercept) 0.03 0.20 -0.37 0.40
## Rhat Bulk_ESS Tail_ESS
## sd(einspectS_Intercept) 1.00 3466 2726
## sd(timeS_Intercept) 1.00 3558 2512
## cor(einspectS_Intercept,timeS_Intercept) 1.00 2932 2497
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## einspectS_Intercept -2.11 0.52 -3.15 -1.11 1.00 1983 2230
## timeS_Intercept -2.24 0.49 -3.20 -1.27 1.00 2504 2741
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS 0.86 0.02 0.82 0.90 1.00 5117 3372
## sigma_timeS 0.92 0.02 0.88 0.96 1.00 4825 2688
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS) 0.06 0.03 -0.00 0.13 1.00 4792
## Tail_ESS
## rescor(einspectS,timeS) 2982
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
By category, there is a slight inverse correlation between
the two outcomes einspect and time; this correlation disappears
if we look at the family terms.
The residual correlation is even a bit lower than in m_1.
Let’s now perform an effects analysis
on the fitted coefficients of m2.
First we introduce a summary function suitable for varying effects models.
Then we use the summary function to analyze the effects of the FL techniques.
## $ints
## MBFL PS ST SBFL
## |0.5 -2.4606500 1.1081400 1.1413275 -2.949263
## |0.7 -2.7027925 0.9450821 0.9713919 -3.200388
## |0.9 -3.1127090 0.6570809 0.6771380 -3.654665
## |0.95 -3.3454335 0.5066452 0.5340358 -3.930798
## |0.99 -3.8695932 0.2231197 0.2725894 -4.369491
## 0.99| -0.6635308 2.5672021 2.6205916 -1.047415
## 0.95| -0.9485867 2.2916865 2.3202460 -1.387823
## 0.9| -1.0984565 2.1301105 2.1644880 -1.559788
## 0.7| -1.4229790 1.8533755 1.8831670 -1.912237
## 0.5| -1.6262350 1.6947850 1.7288225 -2.118677
##
## $est
## MBFL PS ST SBFL
## -2.060815 1.400548 1.433388 -2.548815
## $ints
## MBFL PS ST SBFL
## |0.5 1.0946800 0.22141375 -1.70961000 -2.2593425
## |0.7 0.9465640 0.06355268 -1.94318600 -2.4973660
## |0.9 0.6893988 -0.23781360 -2.42553900 -2.9487880
## |0.95 0.5458556 -0.40320322 -2.64475950 -3.2187678
## |0.99 0.2741525 -0.74160358 -3.19923510 -3.7255032
## 0.99| 2.5363547 1.75528050 -0.09891588 -0.5405140
## 0.95| 2.2093073 1.39546375 -0.34315037 -0.7979575
## 0.9| 2.0756000 1.24684350 -0.47696785 -0.9458480
## 0.7| 1.7981860 0.98256800 -0.76741045 -1.2691605
## 0.5| 1.6436150 0.82162225 -0.95718375 -1.4560650
##
## $est
## MBFL PS ST SBFL
## 1.3723869 0.5184232 -1.3600563 -1.8793254
The results are generally consistent with those of model \(m_1\), although some effects slightly weaken or strengthen.
Let’s see what happens for the bug/category of projects.
## $ints
## CL DEV DS WEB
## |0.5 0.8973230 -1.6409625 0.19578975 -1.4068925
## |0.7 0.7781830 -1.8274915 0.06527855 -1.6149360
## |0.9 0.5577576 -2.2052000 -0.16398785 -1.9760895
## |0.95 0.4281495 -2.3890668 -0.27913252 -2.1481097
## |0.99 0.2105576 -2.7207191 -0.50242412 -2.5928487
## 0.99| 2.0780029 -0.2655134 1.41146000 0.1100508
## 0.95| 1.8353168 -0.4866055 1.14676950 -0.1986511
## 0.9| 1.7212210 -0.5962981 1.02549050 -0.3344486
## 0.7| 1.4851615 -0.8451007 0.80029395 -0.5932893
## 0.5| 1.3611725 -1.0031125 0.66661650 -0.7507050
##
## $est
## CL DEV DS WEB
## 1.1290573 -1.3369545 0.4319761 -1.0976749
## $ints
## CL DEV DS WEB
## |0.5 -1.7739200 0.20136575 0.74401400 -1.13223000
## |0.7 -1.9769675 0.06991984 0.61347920 -1.34135950
## |0.9 -2.3588815 -0.17846145 0.39368695 -1.70685700
## |0.95 -2.5783037 -0.29627345 0.27697138 -1.89915975
## |0.99 -2.9636419 -0.54494419 0.03735386 -2.40549125
## 0.99| -0.3200830 1.42128610 1.91708435 0.30014064
## 0.95| -0.6354105 1.17446125 1.69174025 0.09446047
## 0.9| -0.7477540 1.04970900 1.56658950 -0.06868987
## 0.7| -0.9911982 0.82799765 1.33784450 -0.32573750
## 0.5| -1.1411575 0.69177800 1.20483000 -0.47806200
##
## $est
## CL DEV DS WEB
## -1.4791651 0.4460166 0.9762418 -0.8246671
Here we see some differences, which may partly be due to the fact that \(m_2\) models the different categories more uniformly.
Now, let’s try a variant of \(m_2\) where we go back to
fixed intercepts but add an interaction term between family of FL techniques
and category of projects.
eq.m3 <- brmsformula(
mvbind(einspectS, timeS) ~
0 + family + category + (0 + family|r|category),
family=brmsfamily("gaussian", link="log")
) + set_rescor(TRUE)
pp3.check <- get_prior(eq.m3, data=by.statement)
pp3 <- c(
set_prior("normal(0, 1.0)", class="b", resp=c("einspectS", "timeS")),
set_prior("gamma(0.01, 0.01)", class="sigma", resp=c("einspectS", "timeS")),
set_prior("lkj(1)", class="cor"),
set_prior("weibull(2, 0.3)", class="sd", resp=c("einspectS", "timeS"))
)
Let’s fit \(m_3\) and check the fit.
Prior checks:
We fit model \(m_3\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 4 finished in 141.8 seconds.
## Chain 2 finished in 142.7 seconds.
## Chain 3 finished in 144.9 seconds.
## Chain 1 finished in 152.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 145.4 seconds.
## Total execution time: 152.3 seconds.
Diagnostics:
## [1] 0
## [1] 1.003471
## [1] 0.2959395
Posterior checks:
In line with what seen before, possibly a bit better.
Model comparison:
## Output of model 'm1':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2485.8 97.2
## p_loo 46.2 6.8
## looic 4971.7 194.3
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 942 99.7% 749
## (0.5, 0.7] (ok) 2 0.2% 415
## (0.7, 1] (bad) 1 0.1% 104
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm2':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2479.7 98.4
## p_loo 47.6 7.1
## looic 4959.4 196.9
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 944 99.9% 346
## (0.5, 0.7] (ok) 1 0.1% 163
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm3':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -2476.5 97.4
## p_loo 53.5 8.1
## looic 4953.0 194.9
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 942 99.7% 628
## (0.5, 0.7] (ok) 3 0.3% 127
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
##
## Model comparisons:
## elpd_diff se_diff
## m3 0.0 0.0
## m2 -3.2 6.5
## m1 -9.3 6.3
\(m_2\)’s score is less than half standard deviation worse than \(m_3\)’s. This is a negligible improvement, not worth the additional complexity of model \(m_3\). Thus, we stick with \(m_2\) as our selected model.
## Family: MV(gaussian, gaussian)
## Links: mu = log; sigma = identity
## mu = log; sigma = identity
## Formula: einspectS ~ 0 + family + category + (0 + family | r | category)
## timeS ~ 0 + family + category + (0 + family | r | category)
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~category (Number of levels: 4)
## Estimate Est.Error l-95% CI
## sd(einspectS_familyMBFL) 0.28 0.15 0.05
## sd(einspectS_familyPS) 0.27 0.14 0.05
## sd(einspectS_familyST) 0.33 0.14 0.09
## sd(einspectS_familySBFL) 0.29 0.15 0.05
## sd(timeS_familyMBFL) 0.52 0.14 0.26
## sd(timeS_familyPS) 0.39 0.16 0.10
## sd(timeS_familyST) 0.27 0.14 0.05
## sd(timeS_familySBFL) 0.28 0.15 0.05
## cor(einspectS_familyMBFL,einspectS_familyPS) -0.01 0.33 -0.61
## cor(einspectS_familyMBFL,einspectS_familyST) -0.05 0.33 -0.68
## cor(einspectS_familyPS,einspectS_familyST) 0.01 0.32 -0.60
## cor(einspectS_familyMBFL,einspectS_familySBFL) 0.05 0.33 -0.60
## cor(einspectS_familyPS,einspectS_familySBFL) -0.01 0.34 -0.64
## cor(einspectS_familyST,einspectS_familySBFL) -0.05 0.34 -0.68
## cor(einspectS_familyMBFL,timeS_familyMBFL) 0.06 0.33 -0.59
## cor(einspectS_familyPS,timeS_familyMBFL) 0.10 0.30 -0.51
## cor(einspectS_familyST,timeS_familyMBFL) -0.25 0.28 -0.74
## cor(einspectS_familySBFL,timeS_familyMBFL) 0.07 0.32 -0.57
## cor(einspectS_familyMBFL,timeS_familyPS) 0.07 0.33 -0.59
## cor(einspectS_familyPS,timeS_familyPS) 0.00 0.32 -0.60
## cor(einspectS_familyST,timeS_familyPS) -0.18 0.31 -0.73
## cor(einspectS_familySBFL,timeS_familyPS) 0.08 0.33 -0.57
## cor(timeS_familyMBFL,timeS_familyPS) 0.27 0.29 -0.36
## cor(einspectS_familyMBFL,timeS_familyST) 0.03 0.33 -0.62
## cor(einspectS_familyPS,timeS_familyST) -0.01 0.33 -0.63
## cor(einspectS_familyST,timeS_familyST) -0.01 0.33 -0.64
## cor(einspectS_familySBFL,timeS_familyST) 0.03 0.33 -0.61
## cor(timeS_familyMBFL,timeS_familyST) 0.00 0.32 -0.61
## cor(timeS_familyPS,timeS_familyST) 0.02 0.34 -0.63
## cor(einspectS_familyMBFL,timeS_familySBFL) 0.04 0.33 -0.60
## cor(einspectS_familyPS,timeS_familySBFL) -0.02 0.33 -0.63
## cor(einspectS_familyST,timeS_familySBFL) -0.02 0.34 -0.65
## cor(einspectS_familySBFL,timeS_familySBFL) 0.03 0.33 -0.61
## cor(timeS_familyMBFL,timeS_familySBFL) 0.02 0.33 -0.63
## cor(timeS_familyPS,timeS_familySBFL) 0.04 0.33 -0.60
## cor(timeS_familyST,timeS_familySBFL) 0.04 0.34 -0.61
## u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(einspectS_familyMBFL) 0.61 1.00 3603 1884
## sd(einspectS_familyPS) 0.58 1.00 2415 2138
## sd(einspectS_familyST) 0.62 1.00 2216 1934
## sd(einspectS_familySBFL) 0.63 1.00 3985 2253
## sd(timeS_familyMBFL) 0.81 1.00 3409 2193
## sd(timeS_familyPS) 0.71 1.00 2189 1669
## sd(timeS_familyST) 0.60 1.00 4092 2314
## sd(timeS_familySBFL) 0.61 1.00 3464 1805
## cor(einspectS_familyMBFL,einspectS_familyPS) 0.63 1.00 4017 2761
## cor(einspectS_familyMBFL,einspectS_familyST) 0.59 1.00 3976 2796
## cor(einspectS_familyPS,einspectS_familyST) 0.59 1.00 3409 2945
## cor(einspectS_familyMBFL,einspectS_familySBFL) 0.65 1.00 5144 2572
## cor(einspectS_familyPS,einspectS_familySBFL) 0.64 1.00 4443 3294
## cor(einspectS_familyST,einspectS_familySBFL) 0.63 1.00 4031 2913
## cor(einspectS_familyMBFL,timeS_familyMBFL) 0.66 1.00 2433 2691
## cor(einspectS_familyPS,timeS_familyMBFL) 0.66 1.00 2464 3028
## cor(einspectS_familyST,timeS_familyMBFL) 0.34 1.00 2956 2790
## cor(einspectS_familySBFL,timeS_familyMBFL) 0.67 1.00 2731 3241
## cor(einspectS_familyMBFL,timeS_familyPS) 0.68 1.00 3831 2839
## cor(einspectS_familyPS,timeS_familyPS) 0.62 1.00 3043 3004
## cor(einspectS_familyST,timeS_familyPS) 0.45 1.00 3072 2988
## cor(einspectS_familySBFL,timeS_familyPS) 0.68 1.00 3426 3180
## cor(timeS_familyMBFL,timeS_familyPS) 0.76 1.00 2752 3005
## cor(einspectS_familyMBFL,timeS_familyST) 0.65 1.00 5628 2490
## cor(einspectS_familyPS,timeS_familyST) 0.62 1.00 4997 2858
## cor(einspectS_familyST,timeS_familyST) 0.61 1.00 3987 3302
## cor(einspectS_familySBFL,timeS_familyST) 0.63 1.00 3228 2694
## cor(timeS_familyMBFL,timeS_familyST) 0.62 1.00 3965 3682
## cor(timeS_familyPS,timeS_familyST) 0.65 1.00 3157 3360
## cor(einspectS_familyMBFL,timeS_familySBFL) 0.65 1.00 5670 3219
## cor(einspectS_familyPS,timeS_familySBFL) 0.63 1.00 4610 2988
## cor(einspectS_familyST,timeS_familySBFL) 0.62 1.00 4241 3430
## cor(einspectS_familySBFL,timeS_familySBFL) 0.66 1.00 3017 2579
## cor(timeS_familyMBFL,timeS_familySBFL) 0.63 1.00 4248 3320
## cor(timeS_familyPS,timeS_familySBFL) 0.66 1.00 3117 3237
## cor(timeS_familyST,timeS_familySBFL) 0.66 1.00 2204 2885
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## einspectS_familyMBFL -2.92 0.55 -4.08 -1.95 1.00 3862
## einspectS_familyPS 0.36 0.28 -0.23 0.88 1.00 1642
## einspectS_familyST 0.07 0.34 -0.68 0.63 1.00 1433
## einspectS_familySBFL -3.37 0.53 -4.47 -2.42 1.00 3997
## einspectS_categoryDEV -2.43 0.54 -3.57 -1.41 1.00 3525
## einspectS_categoryDS -0.55 0.35 -1.28 0.11 1.00 1184
## einspectS_categoryWEB -2.23 0.56 -3.40 -1.22 1.00 3344
## timeS_familyMBFL -0.97 0.42 -1.79 -0.15 1.00 1906
## timeS_familyPS -1.39 0.42 -2.22 -0.54 1.00 1544
## timeS_familyST -3.17 0.54 -4.29 -2.14 1.00 3349
## timeS_familySBFL -3.79 0.52 -4.84 -2.80 1.00 3302
## timeS_categoryDEV 0.34 0.50 -0.69 1.26 1.00 1652
## timeS_categoryDS 0.42 0.49 -0.60 1.32 1.00 1748
## timeS_categoryWEB -1.30 0.66 -2.70 -0.14 1.00 3724
## Tail_ESS
## einspectS_familyMBFL 2896
## einspectS_familyPS 2079
## einspectS_familyST 2578
## einspectS_familySBFL 2723
## einspectS_categoryDEV 3166
## einspectS_categoryDS 2246
## einspectS_categoryWEB 2102
## timeS_familyMBFL 2045
## timeS_familyPS 2719
## timeS_familyST 2482
## timeS_familySBFL 3036
## timeS_categoryDEV 2640
## timeS_categoryDS 2138
## timeS_categoryWEB 2595
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma_einspectS 0.85 0.02 0.81 0.89 1.00 7802 2940
## sigma_timeS 0.92 0.02 0.88 0.97 1.00 5768 3152
##
## Residual Correlations:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## rescor(einspectS,timeS) 0.06 0.03 -0.00 0.13 1.00 6486
## Tail_ESS
## rescor(einspectS,timeS) 2824
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Instead of considering the fixed and varying effects of \(m_3\), we can estimate the marginal means for each family of FL techniques (results omitted for brevity, since we’ll focus on \(m_2\) anyway).
Let’s now add predictors to \(m_2\), so as to study any effect of the kinds of bugs:
predicate is a Boolean value that identifies predicate-related bugscrashing is a Boolean value that identifies crashing bugsmutability is a nonnegative score that denotes the percentage of
mutants that mutate a line in a bug’s ground truthmutable is a Boolean that identifies the bugs with a
positive mutability scoreSince mutability/mutable are likely affecting category
and einspect, it makes sense to add the predictor,
so as to close the possible backdoor path \(\textrm{category} \leftarrow \textrm{mutable} \rightarrow \textrm{einspect}\).
We are only interested in controlling for bug kind for einspect,
thus switch to an univariate model where einspect is the only outcome variable.
eq.m4.einspect <- brmsformula(einspectS ~ 1
+ (1|p|family) + (1|q|category)
+ predicate*family
+ crashing*family
+ ismutable*family,
family=brmsfamily("gaussian", link="log"))
eq.m4 <- eq.m4.einspect
pp4.check <- get_prior(eq.m4, data=by.statement)
pp4 <- c(
set_prior("normal(0, 1.0)", class="Intercept"),
set_prior("normal(0, 1.0)", class="b"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="family"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="category"),
set_prior("gamma(0.01, 0.01)", class="sigma")
)
Prior checks:
We fit model \(m_4\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 4 finished in 29.8 seconds.
## Chain 2 finished in 30.9 seconds.
## Chain 3 finished in 32.0 seconds.
## Chain 1 finished in 38.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 32.7 seconds.
## Total execution time: 38.3 seconds.
Diagnostics:
## [1] 0
## [1] 1.003327
## [1] 0.3383162
Posterior checks:
Since \(m_4\) uses less data than the previous models
(it doesn’t consider outcome time), we cannot it compare it to
the other models using LOO (or any information criterion, for that matter).
## Family: gaussian
## Links: mu = log; sigma = identity
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category) + predicate * family + crashing * family + ismutable * family
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~category (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.70 0.12 0.48 0.95 1.00 4044 3090
##
## ~family (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.46 0.20 0.10 0.85 1.00 1799 1862
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept -2.67 0.65 -3.88 -1.36 1.00 2428
## predicateTRUE -0.46 0.50 -1.45 0.50 1.00 2012
## familyPS 1.62 0.68 0.22 2.87 1.00 2388
## familyST 1.89 0.69 0.38 3.12 1.00 2314
## familySBFL -1.20 0.70 -2.57 0.17 1.00 4081
## crashingTRUE -1.15 0.52 -2.18 -0.15 1.00 2799
## ismutableTRUE -0.25 0.48 -1.16 0.69 1.00 1965
## predicateTRUE:familyPS -0.07 0.52 -1.06 0.94 1.00 2010
## predicateTRUE:familyST 0.48 0.50 -0.49 1.48 1.00 1980
## predicateTRUE:familySBFL -0.02 0.86 -1.73 1.66 1.00 5036
## familyPS:crashingTRUE 1.01 0.54 -0.02 2.06 1.00 2822
## familyST:crashingTRUE -1.82 0.65 -3.12 -0.57 1.00 3334
## familySBFL:crashingTRUE 0.03 0.87 -1.67 1.67 1.00 4750
## familyPS:ismutableTRUE 1.01 0.49 0.02 1.96 1.00 2148
## familyST:ismutableTRUE 0.87 0.49 -0.12 1.81 1.00 1914
## familySBFL:ismutableTRUE -0.29 0.81 -1.84 1.31 1.00 4679
## Tail_ESS
## Intercept 2629
## predicateTRUE 2368
## familyPS 2632
## familyST 2882
## familySBFL 2796
## crashingTRUE 2913
## ismutableTRUE 2223
## predicateTRUE:familyPS 2540
## predicateTRUE:familyST 2460
## predicateTRUE:familySBFL 2962
## familyPS:crashingTRUE 2868
## familyST:crashingTRUE 2923
## familySBFL:crashingTRUE 2746
## familyPS:ismutableTRUE 2617
## familyST:ismutableTRUE 2329
## familySBFL:ismutableTRUE 2819
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.78 0.02 0.75 0.82 1.00 6433 2648
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Let’s now perform an effects analysis
on the fitted coefficients of m4.
Specifically, we look at the (fixed) effects
of the families associated with certain categories of bugs,
for response einspect.
## $ints
## crashing MBFL crashing PS crashing ST crashing SBFL
## |0.5 -1.4957225 0.64172875 -2.2535900 -0.5706570
## |0.7 -1.6966420 0.45311365 -2.5091630 -0.8849482
## |0.9 -2.0240975 0.14002575 -2.8906505 -1.4130940
## |0.95 -2.1821472 -0.01729245 -3.1170247 -1.6737832
## |0.99 -2.5282891 -0.35063686 -3.5874488 -2.1767845
## 0.99| 0.1074132 2.44609395 -0.2224062 2.1764674
## 0.95| -0.1510499 2.05582775 -0.5708796 1.6714243
## 0.9| -0.3032274 1.89343600 -0.7896973 1.4436930
## 0.7| -0.6125321 1.57059350 -1.1579595 0.9424029
## 0.5| -0.7958315 1.37090000 -1.3684550 0.6420823
##
## $est
## crashing MBFL crashing PS crashing ST crashing SBFL
## -1.15095905 1.00587635 -1.82331692 0.02942783
## $ints
## predicate MBFL predicate PS predicate ST predicate SBFL
## |0.5 -0.80539225 -0.4252112 0.14267150 -0.5879013
## |0.7 -0.97924005 -0.5941629 -0.04137106 -0.8908687
## |0.9 -1.27643350 -0.8829843 -0.30878825 -1.4396120
## |0.95 -1.44913900 -1.0634027 -0.49199142 -1.7304670
## |0.99 -1.85374910 -1.3837009 -0.76560686 -2.2351072
## 0.99| 0.81469355 1.3561932 1.89115115 2.0862540
## 0.95| 0.50288860 0.9428116 1.47530875 1.6589353
## 0.9| 0.33490690 0.7920933 1.31585700 1.3988270
## 0.7| 0.05072069 0.4752352 1.00939850 0.8711154
## 0.5| -0.12628675 0.2727390 0.81871150 0.5510417
##
## $est
## predicate MBFL predicate PS predicate ST predicate SBFL
## -0.46324267 -0.07029723 0.48350202 -0.02474266
## $ints
## ismutable MBFL ismutable PS ismutable ST ismutable SBFL
## |0.5 -0.56463550 0.67018100 0.54703600 -0.8342113
## |0.7 -0.74068020 0.50529050 0.35888690 -1.1446125
## |0.9 -1.02027550 0.20357500 0.05512335 -1.6622670
## |0.95 -1.16404750 0.01718417 -0.12405297 -1.8383202
## |0.99 -1.46524645 -0.30899653 -0.47210512 -2.3507601
## 0.99| 1.03952155 2.31925060 2.11852570 1.6988942
## 0.95| 0.69069932 1.95853000 1.81475750 1.3090160
## 0.9| 0.52912870 1.82134500 1.66444000 1.0380830
## 0.7| 0.24578715 1.51742650 1.36737800 0.5435591
## 0.5| 0.07147647 1.33092500 1.19420250 0.2685713
##
## $est
## ismutable MBFL ismutable PS ismutable ST ismutable SBFL
## -0.2487351 1.0068292 0.8698194 -0.2899824
So, crashing bugs are indeed easier for ST. In contrast, predicate-related bugs do not seem to be simpler for PS.
For the mutability bugs, we don’t find any consistent
association. Thus, let’s try to add to the model
a finer-grained dependency on mutability
rather than just the boolean indicator mutable.
A simple way would be to introduce an interaction mutability\(\times\)family.
eq.m5.einspect <- brmsformula(einspectS ~ 1
+ (1|p|family) + (1|q|category)
+ predicate*family
+ crashing*family
+ mutability*family,
family=brmsfamily("gaussian", link="log"))
eq.m5 <- eq.m5.einspect
pp5.check <- get_prior(eq.m5, data=by.statement)
pp5 <- c(
set_prior("normal(0, 1.0)", class="Intercept"),
set_prior("normal(0, 1.0)", class="b"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="family"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="category"),
set_prior("gamma(0.01, 0.01)", class="sigma")
)
We could get passable (not great) prior checks, but let’s cut to the chase and fit model \(m_5\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 finished in 3.0 seconds.
## Chain 3 finished in 3.1 seconds.
## Chain 4 finished in 98.2 seconds.
## Chain 2 finished in 117.0 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 55.3 seconds.
## Total execution time: 117.6 seconds.
## Warning: 438 of 4000 (11.0%) transitions hit the maximum treedepth limit of 10.
## See https://mc-stan.org/misc/warnings for details.
## Warning: 2 of 4 chains have a NaN E-BFMI.
## See https://mc-stan.org/misc/warnings for details.
The first thing that we notice is that two of the four chains terminated very quickly (suspiciously fast), whereas the other two went awry and spinned for much longer. In addition, we got a number of scary warnings. This points to some region of the posterior that could not be sampled effectively.
Let’s see the diagnostics:
## [1] 0
## [1] 5.472564
## [1] 0.001008065
A disaster. Let’s also plot the trace plots.
Two chains are straight lines, and hence did not mix at all with the others!
Notice that the distribution of mutability is very skewed,
which explains the difficulties in fitting \(m_5\).
The most straightforward way out of this ditch
is to simply log-transform mutability (after adding 1 to all percentages
so that all logs are defined).
by.statement$logmutability <- log(1 + by.statement$mutability)
eq.m6.einspect <- brmsformula(einspectS ~ 1
+ (1|p|family) + (1|q|category)
+ predicate*family
+ crashing*family
+ logmutability*family,
family=brmsfamily("gaussian", link="log"))
eq.m6 <- eq.m6.einspect
pp6.check <- get_prior(eq.m6, data=by.statement)
pp6 <- c(
set_prior("normal(0, 1.0)", class="Intercept"),
set_prior("normal(0, 1.0)", class="b"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="family"),
set_prior("weibull(2, 0.3)", class="sd", coef="Intercept",
group="category"),
set_prior("gamma(0.01, 0.01)", class="sigma")
)
Alternative ways to modif \(m_5\) so that it can be analyzed (which we mention but don’t further explore here):
Introducing a multi-level term, with einspect ~ log(x)*family,
and log(x) = log(y) + a, where \(x/y = \textrm{mutability}\).
This is based on rewriting \(\log(a/b) = \alpha\)
into \(\log(a) = \alpha + \log(b)\).
The approach followed in this paper.
Prior checks:
We fit model \(m_6\).
## Start sampling
## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 3 finished in 29.2 seconds.
## Chain 4 finished in 30.4 seconds.
## Chain 2 finished in 31.7 seconds.
## Chain 1 finished in 33.2 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 31.1 seconds.
## Total execution time: 33.4 seconds.
Diagnostics:
## [1] 0
## [1] 1.003259
## [1] 0.4447736
Posterior checks:
Everything is A-OK now.
Let’s compare the models \(m_4\) and \(m_6\) using LOO.
## Output of model 'm4':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1143.7 57.6
## p_loo 62.6 9.7
## looic 2287.5 115.2
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 938 99.3% 286
## (0.5, 0.7] (ok) 6 0.6% 155
## (0.7, 1] (bad) 1 0.1% 45
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm6':
##
## Computed from 4000 by 945 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1155.0 62.4
## p_loo 52.5 8.7
## looic 2310.0 124.8
## ------
## Monte Carlo SE of elpd_loo is 0.2.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 940 99.5% 404
## (0.5, 0.7] (ok) 5 0.5% 118
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
##
## All Pareto k estimates are ok (k < 0.7).
## See help('pareto-k-diagnostic') for details.
##
## Model comparisons:
## elpd_diff se_diff
## m4 0.0 0.0
## m6 -11.3 18.3
\(m_6\) and \(m_4\) are very close in terms of predictive capabilities.
## Family: gaussian
## Links: mu = log; sigma = identity
## Formula: einspectS ~ 1 + (1 | p | family) + (1 | q | category) + predicate * family + crashing * family + logmutability * family
## Data: by.statement (Number of observations: 945)
## Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup draws = 4000
##
## Group-Level Effects:
## ~category (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.71 0.12 0.49 0.97 1.00 3682 3030
##
## ~family (Number of levels: 4)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.48 0.18 0.14 0.84 1.00 2042 2078
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS
## Intercept -2.39 0.63 -3.63 -1.16 1.00 2389
## predicateTRUE -0.26 0.49 -1.22 0.69 1.00 2375
## familyPS 1.60 0.67 0.17 2.80 1.00 2430
## familyST 1.90 0.66 0.48 3.12 1.00 2150
## familySBFL -1.28 0.71 -2.67 0.12 1.00 3825
## crashingTRUE -1.19 0.53 -2.25 -0.19 1.00 2228
## logmutability -0.53 0.35 -1.28 0.11 1.00 1944
## predicateTRUE:familyPS 0.06 0.50 -0.94 1.03 1.00 2371
## predicateTRUE:familyST 0.58 0.50 -0.38 1.55 1.00 2406
## predicateTRUE:familySBFL -0.08 0.85 -1.78 1.54 1.00 4831
## familyPS:crashingTRUE 1.06 0.54 0.05 2.15 1.00 2270
## familyST:crashingTRUE -1.79 0.68 -3.16 -0.49 1.00 3195
## familySBFL:crashingTRUE 0.01 0.84 -1.68 1.61 1.00 4947
## familyPS:logmutability 0.63 0.35 -0.00 1.36 1.00 1947
## familyST:logmutability 0.52 0.35 -0.13 1.26 1.00 1954
## familySBFL:logmutability -0.11 0.63 -1.46 0.99 1.00 3366
## Tail_ESS
## Intercept 2811
## predicateTRUE 2740
## familyPS 2606
## familyST 2775
## familySBFL 2990
## crashingTRUE 2284
## logmutability 2136
## predicateTRUE:familyPS 2542
## predicateTRUE:familyST 2562
## predicateTRUE:familySBFL 3103
## familyPS:crashingTRUE 2604
## familyST:crashingTRUE 2819
## familySBFL:crashingTRUE 3009
## familyPS:logmutability 2147
## familyST:logmutability 2130
## familySBFL:logmutability 2465
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sigma 0.80 0.02 0.76 0.83 1.00 4795 3099
##
## Draws were sampled using sample(hmc). For each parameter, Bulk_ESS
## and Tail_ESS are effective sample size measures, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## $ints
## logmutability MBFL logmutability PS logmutability ST logmutability SBFL
## |0.5 -0.75474450 0.375940500 0.266247750 -0.5156230
## |0.7 -0.88416690 0.254462550 0.147502000 -0.7639471
## |0.8 -0.97204130 0.182238400 0.078897560 -0.9411668
## |0.87 -1.05968925 0.115193315 0.008621432 -1.1116126
## |0.9 -1.12050650 0.085546755 -0.027511595 -1.2257940
## |0.95 -1.28001225 -0.004312041 -0.130285675 -1.4553477
## |0.99 -1.47134080 -0.178970675 -0.261238665 -1.8345544
## 0.99| 0.25625661 1.564178850 1.469844800 1.3301171
## 0.95| 0.10875320 1.362056500 1.258613250 0.9929937
## 0.9| 0.01516966 1.221132000 1.107307500 0.8401534
## 0.87| -0.02294809 1.163467750 1.050067700 0.7640344
## 0.8| -0.09323496 1.068632000 0.955611000 0.6521943
## 0.7| -0.16356845 0.982931800 0.874557150 0.5276751
## 0.5| -0.27993625 0.855096250 0.749353000 0.3318090
##
## $est
## logmutability MBFL logmutability PS logmutability ST logmutability SBFL
## -0.5291964 0.6255506 0.5170813 -0.1092753
There is a weak tendency for MBFL to do better on mutable bugs, but it can only be detected with 87% confidence (which is still decent). Incidentally, PS (and, to a lesser degree, ST) tends to perform worse on the same kinds of bugs, whereas SBFL is agnostic.
Finally, let’s also collect the varying intercepts
estimates and intervals for the group-level terms
for family and category. In \(m_6\) these now correspond
to the effects on bugs that are in none of the special categories
(crashing, predicate, mutable); since this is a relatively set,
we don’t expect any very strong tendency (simply because the data is limited).
## $ints
## MBFL PS ST SBFL
## |0.5 -1.09837750 -0.03866348 0.008106222 -1.03776500
## |0.7 -1.31961950 -0.16916845 -0.123050900 -1.32181800
## |0.9 -1.79294900 -0.45615560 -0.354203300 -1.84372150
## |0.95 -2.05672725 -0.59820110 -0.500310250 -2.13864925
## |0.99 -2.59430055 -0.91827276 -0.792044045 -2.81509720
## 0.99| 0.31546326 1.81253300 1.835731450 0.50949495
## 0.95| 0.12512723 1.29965575 1.372258250 0.24451078
## 0.9| 0.03854548 1.09694800 1.147540000 0.13960190
## 0.7| -0.16430215 0.72556065 0.774333900 -0.05992452
## 0.5| -0.31820900 0.52408350 0.571569000 -0.19567100
##
## $est
## MBFL PS ST SBFL
## -0.7456581 0.2577865 0.3136117 -0.6669348
## $ints
## CL DEV DS WEB
## |0.5 0.60921450 -1.9213750 0.2900920 -1.8046375
## |0.7 0.47578020 -2.1417310 0.1576821 -2.0088575
## |0.9 0.21938015 -2.5400950 -0.1101333 -2.3996800
## |0.95 0.02653933 -2.7496218 -0.2652119 -2.6116985
## |0.99 -0.29597590 -3.1878640 -0.6054692 -2.9940812
## 0.99| 1.81451660 -0.4967978 1.4707381 -0.2732759
## 0.95| 1.56997425 -0.7112183 1.2597173 -0.5808576
## 0.9| 1.45210450 -0.8386855 1.1383780 -0.7016571
## 0.7| 1.23746050 -1.1045085 0.9314273 -0.9619804
## 0.5| 1.11337000 -1.2482250 0.7968602 -1.1282750
##
## $est
## CL DEV DS WEB
## 0.8508594 -1.6101808 0.5346215 -1.4848656
Let’s prepare and print some plots of the overall results for model \(m_6\).
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Saving 7 x 5 in image
## [1] "paper/m2-family.pdf"
## Saving 7 x 5 in image
## [1] "paper/m2-category.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-crashing.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-predicate.pdf"
## Saving 7 x 5 in image
## [1] "paper/m6-mutable.pdf"